Week 4: Overfitting/MCMC

MCMC

At this point in the term, we’ll be deviating in our code from McElreath. His course is taught entirely using rethinking, which is a pedogigical tool. It has clear mapping between mathematical models and syntax. But it lacks flexibility and has fewer modeling options.

On the other hand, there is a package called brms that also does Bayesian modeling. This package uses syntax simliar to lme4 (if you’ve used that), supports a wider range of distributions, integrates with the tidyverse ecosystem (if you’ve used that), has more extensive documentation, is more actively maintained, is more widely used (i.e., more support), and is more suitable for complex models.

You’re welcome to use the rethinking package when it suits you, in this course and in your research, but my goal is to introduce you to the brms package. Instead of reviewing the code from McElreath’s lecture today, we’ll be revisiting some familiar models using brms.

model specification

Let’s return to our old friend, the globe-tossing experiment designed to estimate how much of the world is covered with water. Here is our mathematical model:

\[\begin{align*} W &\sim \text{Binomial}(N,p) \\ p &\sim \text{Uniform}(0,1) \\ \end{align*}\]

As a reminder, we have tossed the globe 9 times and landed on water 6 of those times. Here’s how we would use brms to fit our mathematical model with our data.

m42.1 <-
  brm(data = list(w = 6), 
      family = binomial(link = "identity"),
      w | trials(9) ~ 0 + Intercept,
      # this is a flat prior bounded by 0 and 1
      prior(beta(1, 1), class = b, lb = 0, ub = 1),
      iter = 5000, warmup = 1000, chains = 4,
      seed = 3,
      file = here("fits/m42.1"))
m42.1 <-
  brm(data = list(w = 6), 
      family = binomial(link = "identity"),
      w | trials(9) ~ 0 + Intercept,
      # this is a flat prior bounded by 0 and 1
      prior(beta(1, 1), class = b, lb = 0, ub = 1),
      iter = 5000, warmup = 1000, chains = 4,
      seed = 3,
      file = here("fits/m42.1"))

brm() is the core function for fitting Bayesian models using brms.

m42.1 <-
  brm(data = list(w = 6), 
      family = binomial(link = "identity"),
      w | trials(9) ~ 0 + Intercept,
      # this is a flat prior bounded by 0 and 1
      prior(beta(1, 1), class = b, lb = 0, ub = 1),
      iter = 5000, warmup = 1000, chains = 4,
      seed = 3,
      file = here("fits/m42.1"))

family specifies the distribution of the outcome family. In this example, we’re working with a binary outcome (water or land), so we’ll use the binomial distribution. In many examples, we’ll use a gaussian (normal) distribution. But there are many many many options for this.

m42.1 <-
  brm(data = list(w = 6), 
      family = binomial(link = "identity"),
      w | trials(9) ~ 0 + Intercept,
      # this is a flat prior bounded by 0 and 1
      prior(beta(1, 1), class = b, lb = 0, ub = 1),
      iter = 5000, warmup = 1000, chains = 4,
      seed = 3,
      file = here("fits/m42.1"))

The formula argument is what you would expect from the lm() and lmer() functions you have seen in the past. Here, we have some funny syntax – 0 + Intercept – only because we don’t have other predictors in the model. The benefit of brms is that this formula can easily handle complex and non-linear terms. We’ll be playing with more in future classes.

m42.1 <-
  brm(data = list(w = 6), 
      family = binomial(link = "identity"),
      w | trials(9) ~ 0 + Intercept,
      # this is a flat prior bounded by 0 and 1
      prior(beta(1, 1), class = b, lb = 0, ub = 1),
      iter = 5000, warmup = 1000, chains = 4,
      seed = 3,
      file = here("fits/m42.1"))

Here we set our priors. In this case, we only have one prior (on the Intercept). Class b refers to population-level parameters (sometimes called fixed effects). Again, this argument has the ability to become very detailed, specific, and flexible, and we’ll play more with this.

m42.1 <-
  brm(data = list(w = 6), 
      family = binomial(link = "identity"),
      w | trials(9) ~ 0 + Intercept,
      # this is a flat prior bounded by 0 and 1
      prior(beta(1, 1), class = b, lb = 0, ub = 1),
      iter = 5000, warmup = 1000, chains = 4,
      seed = 3,
      file = here("fits/m42.1"))

Hamiltonian MCMC runs for a set number of iterations, throws away the first bit (the warmup), and does that up multiple times (the number of chains).

m42.1 <-
  brm(data = list(w = 6), 
      family = binomial(link = "identity"),
      w | trials(9) ~ 0 + Intercept,
      # this is a flat prior bounded by 0 and 1
      prior(beta(1, 1), class = b, lb = 0, ub = 1),
      iter = 5000, warmup = 1000, chains = 4,
      seed = 3,
      file = here("fits/m42.1"))

Remember, these are random walks through parameter space, so set a seed for reproducbility. Also, these can take a while to run, especially when you are developing more complex models. If you specify a file, the output of the model will automatically be saved. Even better, then next time you run this code, R will check for that file and load it into your workspace instead of re-running the model. (Just be sure to delete the model file if you make changes to any other part of the code.)

summary(m42.1)
 Family: binomial 
  Links: mu = identity 
Formula: w | trials(9) ~ 0 + Intercept 
   Data: list(w = 6) (Number of observations: 1) 
  Draws: 4 chains, each with iter = 5000; warmup = 1000; thin = 1;
         total post-warmup draws = 16000

Regression Coefficients:
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     0.64      0.14     0.35     0.88 1.00     6037     7377

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
plot(m42.1)

Let’s sample from the posterior. First, get_variables() will tell us everything at our disposal.

get_variables(m42.1)
[1] "b_Intercept"   "lprior"        "lp__"          "accept_stat__"
[5] "stepsize__"    "treedepth__"   "n_leapfrog__"  "divergent__"  
[9] "energy__"     

The intercept is the variable we’re looking to sample here. We can use the spread_draws() function to do so.

p42.1 <- m42.1 %>% 
  spread_draws(b_Intercept, ndraws = 1e4, seed = 123)
dim(p42.1)
[1] 10000     4
head(p42.1)
# A tibble: 6 × 4
  .chain .iteration .draw b_Intercept
   <int>      <int> <int>       <dbl>
1      1       2463  2463       0.477
2      1       2511  2511       0.618
3      3       2419 10419       0.749
4      3        718  8718       0.750
5      4        483 12483       0.572
6      1       2986  2986       0.609
Code
p42.1 %>% 
  ggplot(aes(x = b_Intercept)) +
  geom_density(fill = "#1c5253", color = "white") +
  labs(
    title = "Posterior probability",
    x = "probabilty of water"
  ) + 
  scale_y_continuous(NULL, breaks = NULL)

linear models

Let’s return to the height and weight data.

data(Howell1, package = "rethinking")
d <- Howell1
library(measurements)
d$height <- conv_unit(d$height, from = "cm", to = "feet")
d$weight <- conv_unit(d$weight, from = "kg", to = "lbs")
describe(d, fast = T)
       vars   n  mean    sd median  min    max range  skew kurtosis   se
height    1 544  4.54  0.91   4.88 1.77   5.88   4.1 -1.26     0.58 0.04
weight    2 544 78.51 32.45  88.31 9.37 138.87 129.5 -0.54    -0.94 1.39
age       3 544 29.34 20.75  27.00 0.00  88.00  88.0  0.49    -0.56 0.89
male      4 544  0.47  0.50   0.00 0.00   1.00   1.0  0.11    -1.99 0.02
d <- d[d$age >= 18, ]
d$height_c <- d$height - mean(d$height)

We’ll refit our model that predicts weight from height.

m42.2p <- brm(
  data = d, 
  family = gaussian,
  weight ~ height_c,
  prior = c( prior( normal(130,20), class = Intercept),
             prior( normal(0,25), class = b),
             prior( uniform(0,50), class = sigma, ub = 50)
    ), 
  iter = 5000, warmup = 1000,
  seed = 3, 
  sample_prior = "only")

We’ll refit our model that predicts weight from height. Before we fit this to data, we’ll start by only sampling from our priors.

m42.2p <- brm(
  data = d, 
  family = gaussian,
  weight ~ height_c,
  prior = c( prior( normal(130,20), class = Intercept),
             prior( normal(0,25), class = b),
             prior( uniform(0,50), class = sigma, ub = 50)
    ), 
  iter = 5000, warmup = 1000,
  seed = 3, 
  sample_prior = "only")

Again, let’s see what variables are available.

get_variables(m42.2p)
 [1] "b_Intercept"   "b_height_c"    "sigma"         "Intercept"    
 [5] "lprior"        "lp__"          "accept_stat__" "stepsize__"   
 [9] "treedepth__"   "n_leapfrog__"  "divergent__"   "energy__"     

I want to sample from our priors, but all of them at the same time.

p42.2p <- m42.2p %>% 
  spread_draws(b_Intercept, b_height_c, sigma)
head(p42.2p)
# A tibble: 6 × 6
  .chain .iteration .draw b_Intercept b_height_c sigma
   <int>      <int> <int>       <dbl>      <dbl> <dbl>
1      1          1     1        144.       17.0  31.2
2      1          2     2        116.      -22.1  18.4
3      1          3     3        139.      -37.5  32.0
4      1          4     4        158.      -23.6  10.9
5      1          5     5        102.       20.3  35.9
6      1          6     6        139.       13.7  32.7

We’ll plot the regression lines from the priors against the real data, to see if they make sense.

Code
labels = seq(4, 6, by = .5)
breaks = labels - mean(d$height)
d %>% 
  ggplot(aes(x = height_c, y = weight)) + 
  geom_blank()+
  geom_abline(aes( intercept=b_Intercept, slope=b_height_c), 
              data = p42.2p[1:50, ], #first 50 draws only
              color = "#1c5253",
              alpha = .3) +
  scale_x_continuous("height(feet)", breaks = breaks, labels = labels) +
  scale_y_continuous("weight(lbs)", limits = c(50,150))

Let’s see if we can improve upon this model. One thing we know for sure is that the relationship between height and weight is positive. We may not know the exact magnitude, but we can use a distribution that doesn’t go below zero. We’ve already discussed uniform distributions, but those are pretty uninformative – they won’t do a good job regularizing – and we can also run into trouble if our bounds are not inclusive enough.

The log-normal distribution would be a good option here.

Code
set.seed(4)

tibble(b = rlnorm(1e4, mean = 0, sd = 1)) %>% 
  ggplot(aes(x = b)) +
  geom_density(fill = "grey92") +
  coord_cartesian(xlim = c(0, 5)) +
  labs(title = "Log-Normal(0,1)")

The log-normal is the distribution whose logarithm is normally distributed.

Code
set.seed(4)

tibble(rnorm           = rnorm(1e5, mean = 0, sd = 1),
       `log(rlognorm)` = log(rlnorm(1e5, mean = 0, sd = 1))) %>% 
  pivot_longer(everything()) %>% 

  ggplot(aes(x = value)) +
  geom_density(fill = "grey92") +
  coord_cartesian(xlim = c(-3, 3)) +
  facet_wrap(~ name, nrow = 2)

Let’s try this new prior. Play around with the plot code to find parameters that you think are reasonable. I’m going to use 1,2.

m42.2p <- brm(
  data = d, 
  family = gaussian,
  weight ~ height_c,
  prior = c( prior( normal(130,20), class = Intercept),
             prior( lognormal(1,2), class = b),
             prior( uniform(0,50), class = sigma, ub = 50)
    ), 
  iter = 5000, warmup = 1000,
  seed = 3, 
  sample_prior = "only")
Code
p42.2p <- m42.2p %>% 
  spread_draws(b_Intercept, b_height_c, sigma)
d %>% 
  ggplot(aes(x = height_c, y = weight)) + 
   geom_blank()+
  geom_abline(aes( intercept=b_Intercept, slope=b_height_c), 
              data = p42.2p[1:50, ], #first 50 draws only
              color = "#1c5253",
              alpha = .3) +
  scale_x_continuous("height(feet)", breaks = breaks, labels = labels) +
  scale_y_continuous("weight(lbs)", limits = c(50,150))

Applied to our dataset:

Code
m42.2 <- brm(
  data = d, 
  family = gaussian,
  weight ~ height_c,
  prior = c( prior( normal(130,20), class = Intercept),
             prior( lognormal(1,2), class = b),
             prior( uniform(0,50), class = sigma, ub = 50)
    ), 
  iter = 5000, warmup = 1000,
  seed = 3,
  file = here("fits/m42.2"))


summary(m42.2)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: weight ~ height_c 
   Data: d (Number of observations: 352) 
  Draws: 4 chains, each with iter = 5000; warmup = 1000; thin = 1;
         total post-warmup draws = 16000

Regression Coefficients:
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept    99.20      0.50    98.22   100.18 1.00    13657    11564
height_c     42.16      2.01    38.26    46.07 1.00    17384    12434

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     9.39      0.36     8.72    10.13 1.00    15405    11446

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Let’s return to the tidybayes functions for summaries. As a reminder, we already saw spread_draws()

m42.2 %>% 
  spread_draws(b_Intercept, b_height_c, sigma) %>% 
  sample_n(5)
# A tibble: 5 × 6
  .chain .iteration .draw b_Intercept b_height_c sigma
   <int>      <int> <int>       <dbl>      <dbl> <dbl>
1      3       3947 11947        99.8       44.2 10.0 
2      3       1224  9224        98.7       42.2  9.06
3      4        875 12875       100.        39.8  9.36
4      2       3513  7513        99.9       41.9  9.61
5      4       2189 14189        99.1       42.0  9.78

We also have gather_draws():

m42.2 %>% 
  gather_draws(b_Intercept, b_height_c, sigma)   %>% 
  sample_n(2)
# A tibble: 6 × 5
# Groups:   .variable [3]
  .chain .iteration .draw .variable   .value
   <int>      <int> <int> <chr>        <dbl>
1      2       1207  5207 b_Intercept  99.4 
2      3       1932  9932 b_Intercept  98.7 
3      4        920 12920 b_height_c   42.3 
4      3       3819 11819 b_height_c   44.6 
5      3        294  8294 sigma         9.23
6      4       1107 13107 sigma         9.81

What is the difference between these?

gather_draws() is a useful function if we’re thinking about summarizing the results of our models.

m42.2 %>% 
  gather_draws(b_Intercept, b_height_c, sigma) %>% 
  median_qi()
# A tibble: 3 × 7
  .variable   .value .lower .upper .width .point .interval
  <chr>        <dbl>  <dbl>  <dbl>  <dbl> <chr>  <chr>    
1 b_height_c   42.1   38.3    46.1   0.95 median qi       
2 b_Intercept  99.2   98.2   100.    0.95 median qi       
3 sigma         9.38   8.72   10.1   0.95 median qi       
m42.2 %>% 
  gather_draws(b_Intercept, b_height_c, sigma) %>% 
  ggplot(aes(x = .value, y=.variable)) +
  stat_halfeye()